{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Compare over-sampling samplers\n\nThe following example attends to make a qualitative comparison between the\ndifferent over-sampling algorithms available in the imbalanced-learn package.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\n\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\nsns.set_context(\"poster\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to create toy dataset. It uses the\n:func:`~sklearn.datasets.make_classification` from scikit-learn but fixing\nsome parameters.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import make_classification\n\n\ndef create_dataset(\n    n_samples=1000,\n    weights=(0.01, 0.01, 0.98),\n    n_classes=3,\n    class_sep=0.8,\n    n_clusters=1,\n):\n    return make_classification(\n        n_samples=n_samples,\n        n_features=2,\n        n_informative=2,\n        n_redundant=0,\n        n_repeated=0,\n        n_classes=n_classes,\n        n_clusters_per_class=n_clusters,\n        weights=list(weights),\n        class_sep=class_sep,\n        random_state=0,\n    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to plot the sample space after resampling\nto illustrate the specificities of an algorithm.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def plot_resampling(X, y, sampler, ax, title=None):\n    X_res, y_res = sampler.fit_resample(X, y)\n    ax.scatter(X_res[:, 0], X_res[:, 1], c=y_res, alpha=0.8, edgecolor=\"k\")\n    if title is None:\n        title = f\"Resampling with {sampler.__class__.__name__}\"\n    ax.set_title(title)\n    sns.despine(ax=ax, offset=10)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following function will be used to plot the decision function of a\nclassifier given some data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\n\ndef plot_decision_function(X, y, clf, ax, title=None):\n    plot_step = 0.02\n    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n    xx, yy = np.meshgrid(\n        np.arange(x_min, x_max, plot_step), np.arange(y_min, y_max, plot_step)\n    )\n\n    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n    Z = Z.reshape(xx.shape)\n    ax.contourf(xx, yy, Z, alpha=0.4)\n    ax.scatter(X[:, 0], X[:, 1], alpha=0.8, c=y, edgecolor=\"k\")\n    if title is not None:\n        ax.set_title(title)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Illustration of the influence of the balancing ratio\n\nWe will first illustrate the influence of the balancing ratio on some toy\ndata using a logistic regression classifier which is a linear model.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We will fit and show the decision boundary model to illustrate the impact of\ndealing with imbalanced classes.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 12))\n\nweights_arr = (\n    (0.01, 0.01, 0.98),\n    (0.01, 0.05, 0.94),\n    (0.2, 0.1, 0.7),\n    (0.33, 0.33, 0.33),\n)\nfor ax, weights in zip(axs.ravel(), weights_arr):\n    X, y = create_dataset(n_samples=300, weights=weights)\n    clf.fit(X, y)\n    plot_decision_function(X, y, clf, ax, title=f\"weight={weights}\")\n    fig.suptitle(f\"Decision function of {clf.__class__.__name__}\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Greater is the difference between the number of samples in each class, poorer\nare the classification results.\n\n## Random over-sampling to balance the data set\n\nRandom over-sampling can be used to repeat some samples and balance the\nnumber of samples between the dataset. It can be seen that with this trivial\napproach the boundary decision is already less biased toward the majority\nclass. The class :class:`~imblearn.over_sampling.RandomOverSampler`\nimplements such of a strategy.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.pipeline import make_pipeline\nfrom imblearn.over_sampling import RandomOverSampler\n\nX, y = create_dataset(n_samples=100, weights=(0.05, 0.25, 0.7))\n\nfig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))\n\nclf.fit(X, y)\nplot_decision_function(X, y, clf, axs[0], title=\"Without resampling\")\n\nsampler = RandomOverSampler(random_state=0)\nmodel = make_pipeline(sampler, clf).fit(X, y)\nplot_decision_function(X, y, model, axs[1], f\"Using {model[0].__class__.__name__}\")\n\nfig.suptitle(f\"Decision function of {clf.__class__.__name__}\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "By default, random over-sampling generates a bootstrap. The parameter\n`shrinkage` allows adding a small perturbation to the generated data\nto generate a smoothed bootstrap instead. The plot below shows the difference\nbetween the two data generation strategies.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 7))\n\nsampler.set_params(shrinkage=None)\nplot_resampling(X, y, sampler, ax=axs[0], title=\"Normal bootstrap\")\n\nsampler.set_params(shrinkage=0.3)\nplot_resampling(X, y, sampler, ax=axs[1], title=\"Smoothed bootstrap\")\n\nfig.suptitle(f\"Resampling with {sampler.__class__.__name__}\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "It looks like more samples are generated with smoothed bootstrap. This is due\nto the fact that the samples generated are not superimposing with the\noriginal samples.\n\n## More advanced over-sampling using ADASYN and SMOTE\n\nInstead of repeating the same samples when over-sampling or perturbating the\ngenerated bootstrap samples, one can use some specific heuristic instead.\n:class:`~imblearn.over_sampling.ADASYN` and\n:class:`~imblearn.over_sampling.SMOTE` can be used in this case.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn import FunctionSampler  # to use a idendity sampler\nfrom imblearn.over_sampling import SMOTE, ADASYN\n\nX, y = create_dataset(n_samples=150, weights=(0.1, 0.2, 0.7))\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\n\nsamplers = [\n    FunctionSampler(),\n    RandomOverSampler(random_state=0),\n    SMOTE(random_state=0),\n    ADASYN(random_state=0),\n]\n\nfor ax, sampler in zip(axs.ravel(), samplers):\n    title = \"Original dataset\" if isinstance(sampler, FunctionSampler) else None\n    plot_resampling(X, y, sampler, ax, title=title)\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The following plot illustrates the difference between\n:class:`~imblearn.over_sampling.ADASYN` and\n:class:`~imblearn.over_sampling.SMOTE`.\n:class:`~imblearn.over_sampling.ADASYN` will focus on the samples which are\ndifficult to classify with a nearest-neighbors rule while regular\n:class:`~imblearn.over_sampling.SMOTE` will not make any distinction.\nTherefore, the decision function depending of the algorithm.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X, y = create_dataset(n_samples=150, weights=(0.05, 0.25, 0.7))\n\nfig, axs = plt.subplots(nrows=1, ncols=3, figsize=(20, 6))\n\nmodels = {\n    \"Without sampler\": clf,\n    \"ADASYN sampler\": make_pipeline(ADASYN(random_state=0), clf),\n    \"SMOTE sampler\": make_pipeline(SMOTE(random_state=0), clf),\n}\n\nfor ax, (title, model) in zip(axs, models.items()):\n    model.fit(X, y)\n    plot_decision_function(X, y, model, ax=ax, title=title)\n\nfig.suptitle(f\"Decision function using a {clf.__class__.__name__}\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Due to those sampling particularities, it can give rise to some specific\nissues as illustrated below.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "X, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)\n\nsamplers = [SMOTE(random_state=0), ADASYN(random_state=0)]\n\nfig, axs = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function with {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.suptitle(\"Particularities of over-sampling with SMOTE and ADASYN\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "SMOTE proposes several variants by identifying specific samples to consider\nduring the resampling. The borderline version\n(:class:`~imblearn.over_sampling.BorderlineSMOTE`) will detect which point to\nselect which are in the border between two classes. The SVM version\n(:class:`~imblearn.over_sampling.SVMSMOTE`) will use the support vectors\nfound using an SVM algorithm to create new sample while the KMeans version\n(:class:`~imblearn.over_sampling.KMeansSMOTE`) will make a clustering before\nto generate samples in each cluster independently depending each cluster\ndensity.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.over_sampling import BorderlineSMOTE, KMeansSMOTE, SVMSMOTE\n\nX, y = create_dataset(n_samples=5000, weights=(0.01, 0.05, 0.94), class_sep=0.8)\n\nfig, axs = plt.subplots(5, 2, figsize=(15, 30))\n\nsamplers = [\n    SMOTE(random_state=0),\n    BorderlineSMOTE(random_state=0, kind=\"borderline-1\"),\n    BorderlineSMOTE(random_state=0, kind=\"borderline-2\"),\n    KMeansSMOTE(random_state=0),\n    SVMSMOTE(random_state=0),\n]\n\nfor ax, sampler in zip(axs, samplers):\n    model = make_pipeline(sampler, clf).fit(X, y)\n    plot_decision_function(\n        X, y, clf, ax[0], title=f\"Decision function for {sampler.__class__.__name__}\"\n    )\n    plot_resampling(X, y, sampler, ax[1])\n\nfig.suptitle(\"Decision function and resampling using SMOTE variants\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When dealing with a mixed of continuous and categorical features,\n:class:`~imblearn.over_sampling.SMOTENC` is the only method which can handle\nthis case.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from collections import Counter\nfrom imblearn.over_sampling import SMOTENC\n\nrng = np.random.RandomState(42)\nn_samples = 50\n# Create a dataset of a mix of numerical and categorical data\nX = np.empty((n_samples, 3), dtype=object)\nX[:, 0] = rng.choice([\"A\", \"B\", \"C\"], size=n_samples).astype(object)\nX[:, 1] = rng.randn(n_samples)\nX[:, 2] = rng.randint(3, size=n_samples)\ny = np.array([0] * 20 + [1] * 30)\n\nprint(\"The original imbalanced dataset\")\nprint(sorted(Counter(y).items()))\nprint()\nprint(\"The first and last columns are containing categorical features:\")\nprint(X[:5])\nprint()\n\nsmote_nc = SMOTENC(categorical_features=[0, 2], random_state=0)\nX_resampled, y_resampled = smote_nc.fit_resample(X, y)\nprint(\"Dataset after resampling:\")\nprint(sorted(Counter(y_resampled).items()))\nprint()\nprint(\"SMOTE-NC will generate categories for the categorical features:\")\nprint(X_resampled[-5:])\nprint()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "However, if the dataset is composed of only categorical features then one\nshould use :class:`~imblearn.over_sampling.SMOTEN`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.over_sampling import SMOTEN\n\n# Generate only categorical data\nX = np.array([\"A\"] * 10 + [\"B\"] * 20 + [\"C\"] * 30, dtype=object).reshape(-1, 1)\ny = np.array([0] * 20 + [1] * 40, dtype=np.int32)\n\nprint(f\"Original class counts: {Counter(y)}\")\nprint()\nprint(X[:5])\nprint()\n\nsampler = SMOTEN(random_state=0)\nX_res, y_res = sampler.fit_resample(X, y)\nprint(f\"Class counts after resampling {Counter(y_res)}\")\nprint()\nprint(X_res[-5:])\nprint()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}